Proportion of Political Seats Held by Women: by Country & Year¶

Jennifer Smith¶

June 09, 2022¶

For this project, I found an interesting dataset on Kaggle that lists the proportion of political seats held by women in each country by year. I was curious to see what the distribution of women in power looks like globally, and if it has changed over time. This is the dataset that I used: https://www.kaggle.com/datasets/mathurinache/women-in-power3?select=Viz5_August_Female_Political_Representation.csv

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

women = pd.read_csv('Viz5_August_Female_Political_Representation.csv')
women.head()
Out[1]:
Country Name Country Code Year Proportion of seats held by women in national parliaments (%)
0 Albania ALB 1997 NaN
1 Albania ALB 1998 NaN
2 Albania ALB 1999 0.051613
3 Albania ALB 2000 0.051613
4 Albania ALB 2001 0.057143

Above is a snapshot of the data in this dataset. The dataset is global, and while it may be missing a few pieces of information, I would say it is a population rather than a sample.

I'm interested in the proportion of seats held by women in national parliaments column, to look at the distribution, any outliers, and change over time. This will be interesting to look at because it should provide insight into the standing of women in power over time. I'm hoping looking at the statistical distribution will reveal some nuance in the overall trend.

First though, I can see that I'll need to reshape the data because the years are all in one column and I need them to be in separate columns. Below shows the result of my my pivot.

In [2]:
by_year = women.pivot_table('Proportion of seats held by women in national parliaments (%)', index = 'Country Name', 
                              columns='Year')
by_year.head()
Out[2]:
Year 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Country Name
Albania NaN NaN 0.051613 0.051613 0.057143 0.057143 0.057143 0.064286 0.071429 0.071429 ... 0.164286 0.157143 0.157143 0.178571 0.200000 0.207143 0.228571 0.278571 0.278571 0.295082
Algeria 0.031579 0.031579 0.031579 0.034211 0.034211 0.061697 0.061697 0.061697 0.061697 0.061697 ... 0.077121 0.079692 0.316017 0.316017 0.316017 0.316017 0.316017 0.257576 0.257576 0.257576
Andorra 0.071429 0.071429 0.071429 0.071429 0.142857 0.142857 0.142857 0.142857 0.285714 0.285714 ... 0.357143 0.500000 0.500000 0.500000 0.500000 0.392857 0.321429 0.321429 0.321429 0.464286
Angola 0.095455 0.154545 0.154545 0.154545 0.154545 0.154545 0.154545 0.150000 0.150000 0.150000 ... 0.386364 0.381818 0.340909 0.340909 0.368182 0.368182 0.368182 0.304545 0.304545 0.300000
Antigua and Barbuda 0.052632 0.052632 NaN 0.052632 0.052632 0.052632 0.052632 0.105263 0.105263 0.105263 ... 0.105263 0.105263 0.105263 0.105263 0.111111 0.111111 0.111111 0.111111 0.111111 0.111111

5 rows × 23 columns

This is what I wanted, but I thought it was too granular, so I decided to select the columns every 5 years ending at the most recent (2019).

Below describes the data, including the mean, standard deviation, min, Q1, median, Q3, and max. One thing that sticks out to me is that the standard deviation gets larger over time, which appears to be because the min stays at 0, but the max grows over time. This means the range is getting larger, which naturally means more variability.

In [3]:
by_year[[1999,2004,2009,2014,2019]].describe()
Out[3]:
Year 1999 2004 2009 2014 2019
count 184.000000 213.000000 214.000000 211.000000 214.000000
mean 0.115095 0.146699 0.178698 0.205967 0.232317
std 0.081317 0.090338 0.100370 0.109409 0.113268
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.056393 0.090159 0.106012 0.132161 0.160500
50% 0.107625 0.132075 0.174006 0.195741 0.220057
75% 0.160685 0.192308 0.226250 0.260849 0.299257
max 0.426934 0.487500 0.562500 0.637500 0.612500

The remaining statistical measures that aren't included in the describe() function are listed below.

In [4]:
print('MODE')
print(by_year[[1999,2004,2009,2014,2019]].mode())
print('VARIANCE')
print(by_year[[1999,2004,2009,2014,2019]].var())
print('MEAN ABSOLUTE DEVIATION')
print(by_year[[1999,2004,2009,2014,2019]].mad())
print('RANGE')
proportion_range = by_year[[1999,2004,2009,2014,2019]].max() - by_year[[1999,2004,2009,2014,2019]].min()
print(proportion_range)
MODE
Year  1999  2004  2009  2014  2019
0      0.0   0.0   0.0   0.0  0.20
1      NaN   NaN   NaN   NaN  0.25
VARIANCE
Year
1999    0.006613
2004    0.008161
2009    0.010074
2014    0.011970
2019    0.012830
dtype: float64
MEAN ABSOLUTE DEVIATION
Year
1999    0.061107
2004    0.068598
2009    0.075702
2014    0.082533
2019    0.086442
dtype: float64
RANGE
Year
1999    0.426934
2004    0.487500
2009    0.562500
2014    0.637500
2019    0.612500
dtype: float64
In [5]:
import matplotlib.ticker as mtick

by_year[[1999,2004,2009,2014,2019]].boxplot(grid=False, figsize = (12,8), showmeans=True)

plt.ylabel('share of seats held by women', fontsize=14)
plt.title('Proportion of political seats held by women globally', fontsize=16)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))

plt.show
Out[5]:
<function matplotlib.pyplot.show(close=None, block=None)>

After reviewing my boxplot, I noticed that the upper bound grew over time, and the number of outliers shrank. I decided to look at which countries remained outliers in 2019 (below).

In [6]:
IQR2019 = by_year[2019].describe()['75%'] - by_year[2019].describe()['25%']
print('IQR 2019:', IQR2019)

upper_bound = by_year[2019].describe()['75%'] + (1.5 * IQR2019)
print('Upper Bound 2019:', upper_bound)

lower_bound = by_year[2019].describe()['25%'] - (1.5 * IQR2019)
print('Lower Bound 2019:', lower_bound)
print()

outliers_in_2019 = by_year[2019][by_year[2019] > upper_bound]
print('Outliers in 2019:') 
print(outliers_in_2019)
IQR 2019: 0.13875742574249997
Upper Bound 2019: 0.5073935643562499
Lower Bound 2019: -0.04763613861374996

Outliers in 2019:
Country Name
Bolivia    0.530769
Cuba       0.532231
Rwanda     0.612500
Name: 2019, dtype: float64
In [7]:
by_year[[1999,2004, 2009,  2014, 2019]].hist(grid=False, color='thistle', edgecolor='slategrey', figsize = (12,11), 
                                         bins=[0,.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65])

plt.show
Out[7]:
<function matplotlib.pyplot.show(close=None, block=None)>

After trying the histogram with all 5 years I'd included in my boxplot, I noticed that the middle three don't change that much and generally all move in the direction of the final one, so I decided it might be more impactful to only look at the first and last to see the change that happened in 20 years. That is below.

In [8]:
fig = by_year[[1999,2019]].hist(grid=False, color='thistle', edgecolor='slategrey', figsize = (14,6), 
                                         bins=[0,.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65])

fig[0][0].set_xlabel('share of seats held by women', fontsize=12)
fig[0][0].set_ylabel('count of countries', fontsize=12)
fig[0][0].set_title('1999 global women in power', fontsize=14)
fig[0][0].xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
fig[0][1].set_xlabel('share of seats held by women', fontsize=12)
fig[0][1].set_ylabel('count of countries', fontsize=12)
fig[0][1].set_title('2019 global women in power', fontsize=14)
fig[0][1].xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))

plt.show
Out[8]:
<function matplotlib.pyplot.show(close=None, block=None)>

Takeaways¶

i. Which measure of center is the most appropriate description of center for your data? Explain why.¶

The mean and median are fairly close in this dataset, which you can see on the boxplot (the green lines are the medians, and the green triangles are the means). I would ultimately choose the median as most appropriate, because there are outliers that do pull the mean up a bit each year. The mode is not appropriate, because in many years it is 0, which is not representative of the global picture.

ii. Did your variable contain outliers (detected using the IQR)? What were the upper and lower bounds for your whiskers?¶

My variables did contain outliers, all above the upper bounds, since the lower bound in every year was 0 (using the calculation, it would actually be negative which doesn't make sense for percentages).
In 2019, the upper bound for the whiskers was 50.74%, and the lower bound would technically be -4.76%, but was truly 0%.

iii. What kind of distribution did your data have (unimodal, skewed left, etc.)?¶

In 1999, the data had a skewed right distribution, and in 2019 it was unimodal, though still slightly skewed right.

iv. Are there any other interesting things your analysis shows?¶

When I looked at the details of the outliers in 2019, I was quite surprised to see that they were Bolivia, Cuba, and Rwanda with the highest share of women in political seats. I would have expected larger, wealthier countries to have the highest shares.

I was pleased to see progress with more women holding political seats over time, but there's still a fair amount of progress to be made, as even in 2019 the largest bins on the histogram are between 10% and 25%.

I also noticed on the boxplot that the Q3 grew more over time than the Q1, which would indicate that there are some countries where very little progress is being made. Countries that already had more female political representation are more likely to grow female political representation.

In [ ]: